Skip to content

feat(tools): truncate large tool results with workspace spill#489

Open
TatsuKo-Tsukimi wants to merge 3 commits intodataelement:mainfrom
TatsuKo-Tsukimi:feat/tool-result-truncation
Open

feat(tools): truncate large tool results with workspace spill#489
TatsuKo-Tsukimi wants to merge 3 commits intodataelement:mainfrom
TatsuKo-Tsukimi:feat/tool-result-truncation

Conversation

@TatsuKo-Tsukimi
Copy link
Copy Markdown

Summary

Tool calls returning ~50KB of search hits or PDF extracts used to enter the in-context history verbatim, burning tokens for content the model samples one slice of. New services/tool_result_truncation.py spills oversized payloads to agent_data/<agent_id>/_tool_results/<call_id>.txt and replaces the in-context body with a head excerpt + a marker pointing the model at read_file for retrieval.

Threshold: 4000 estimated tokens (~12KB at chars/3) — soft-start to give the model time to learn the read_file follow-up; can tighten to ~2000 once telemetry confirms reliable use of the marker.

Smart head:

  • JSON-shape responses with a known array key (results/items/data/entries/hits/documents) keep metadata + first 5 items + a synthetic _truncated_<key>_count field, so search/list responses stay useful
  • Otherwise plain head-cut (preferring a line boundary when one is near the cut point)
  • Vision-injected list payloads (image_data markers) pass through

Sandbox: spill files live under each agent's existing AGENT_DATA_DIR sandbox; cross-agent leak is prevented by the existing read_file boundary.

Changes

Applied at three sites to prevent drift:

  • llm/caller.py:_process_tool_call (call_llm path)
  • llm/caller.py:call_agent_llm_with_tools inner loop
  • services/heartbeat.py tool loop

agent_context.py rule 3 augmented with a paragraph teaching the model to recognize the marker and use read_file for full content rather than fabricating from prior knowledge.

16 new unit tests (pass-through, spill, utf-8 round-trip, threshold boundary, JSON-shape preservation, write-failure fallback, realistic 50-result jina_search end-to-end).

Stack

PR 3/4 of context-budget hygiene series.

Test plan

  • pytest backend/tests/test_tool_result_truncation.py — 16/16 pass
  • Reviewer: 4000-token threshold — too aggressive for jina_search returns? Should we per-tool-customize?
  • Reviewer: spill files accumulate forever — need a GC story or retention policy follow-up?
  • Reviewer: confirm the sandbox boundary (agent A can't read agent B's _tool_results/) holds under A2A relationship paths

Note: 3 tool loops still exist (caller.py:432, caller.py:763, heartbeat.py:335). Drift prevention here is partial — the next PR consolidates them into a single execute_tool_calls helper.

🤖 Generated with Claude Code

Replace naive `conversation[-ctx_size:]` slicing with a walker that treats
assistant.tool_calls and its matching role="tool" messages as one atomic
block. Naive slicing can leave an orphan role="tool" at the head when the
cut lands mid-pair — OpenAI rejects this with "No tool call found for
function call output" (issue dataelement#446).

New helper `services/history_window.truncate_by_message_count` walks tail
backward, identifies blocks, and includes blocks whole or not at all.
Orphan tools are silently dropped regardless of budget.

Replaces head-only pop guard in:
  - app/api/websocket.py (web chat)
  - app/api/feishu.py (feishu channel)

Leaves app/services/llm/caller.py:626 untouched (call_agent_llm short-reply
path's hardcoded [-10:] is intentional).

Tests: 15 covering empty input, budget edges, parallel tool_calls,
multi-pair, head/mid orphans, realistic 60-message invariant.

Addresses dataelement#446 failure mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b880d1ff67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return tool_content

# Spill full content to disk
full_path = agent_workspace / "_tool_results" / f"{call_id}.txt"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use a unique spill filename per tool invocation

Writing spill files as _tool_results/{call_id}.txt can overwrite earlier truncated outputs because call_id is not globally unique across rounds for all providers (for example, the Gemini adapters synthesize IDs like call_1, call_2 each response in backend/app/services/llm/client.py). When a later tool call reuses the same ID, an older in-context marker now points to unrelated content, so a subsequent read_file retrieves the wrong payload and can corrupt the agent’s reasoning/output.

Useful? React with 👍 / 👎.

@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the feat/tool-result-truncation branch from b880d1f to 1cff15b Compare April 27, 2026 08:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cff15b156

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


try:
spill_root.mkdir(parents=True, exist_ok=True)
full_path.write_text(tool_content, encoding="utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound growth of spilled tool-result files

Each oversized tool call writes a full payload file under _tool_results/ and there is no retention or cleanup path anywhere in the new truncation flow, so long-running agents that frequently call search/PDF tools will accumulate files without bound until AGENT_DATA_DIR fills up. Once disk pressure hits, subsequent spills fall back to partial excerpts and the model can no longer retrieve full results, so this becomes a production reliability issue rather than just storage overhead.

Useful? React with 👍 / 👎.

TatsuKo-Tsukimi and others added 2 commits April 27, 2026 17:51
Add `agents.context_window_tokens` field (default 50000) and a new
`truncate_by_token_budget` helper that bounds in-context history by both
estimated token cost (primary) and message count (safety cap), preserving
assistant.tool_calls ↔ role=tool pairs intact via the same walker as
truncate_by_message_count.

Why: message-count alone is a wildly variable proxy for token cost — one
50KB tool result eats more context than 100 short user messages. Token
budget gives predictable behavior across heterogeneous traffic; message
cap remains as a safety net against pathological tiny-message floods.

Changes:
  - models/agent.py: + context_window_tokens (Integer, default=50000)
    + DEFAULT_CONTEXT_WINDOW_TOKENS constant
  - schemas/schemas.py: AgentOut, AgentUpdate (1000 <= tokens <= 500000)
  - alembic: add_context_window_tokens.py (idempotent IF NOT EXISTS)
  - services/history_window.py: + truncate_by_token_budget, refactored
    common walker, JSON-serialized char->token estimate via existing
    estimate_tokens_from_chars (chars/3 — overestimates safely)
  - api/websocket.py: pass tok_budget to helper, raise DB load to
    max(ctx_size, 500) so helper has room to choose
  - api/feishu.py: same pattern at 2 sites (web chat + IM channel paths)
  - frontend: AgentDetail Settings slider + i18n + types

10 new tests covering token-budget mode (huge-message dropped, both-bounds
interaction, atomic pair preservation, orphan defense). 25/25 pass.

Other channels (dingtalk/discord/slack/teams/wecom/whatsapp) still use
DB-level message-count limit only — they don't get token awareness in
this PR but won't crash. Migrating them is follow-up scope.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tool calls returning ~50 KB of search hits or PDF extracts used to enter
the in-context history verbatim, burning tokens for content the model
samples one slice of. New `services/tool_result_truncation.py` spills
oversized payloads to `agent_data/<agent_id>/_tool_results/<call_id>.txt`
and replaces the in-context body with a head excerpt + a marker pointing
the model at `read_file` for retrieval.

Threshold: 4000 estimated tokens (~12 KB at chars/3) — soft-start to give
the model time to learn the read_file follow-up; can tighten to ~2000
once telemetry confirms reliable use of the marker.

Smart head:
  - JSON-shape responses with a known array key (results/items/data/
    entries/hits/documents) keep metadata + first 5 items + a synthetic
    _truncated_<key>_count field, so search/list responses stay useful.
  - Otherwise plain head-cut (preferring a line boundary when one is
    near the cut point).
  - Vision-injected list payloads (image_data markers) pass through.

Sandbox: spill files live under each agent's existing AGENT_DATA_DIR
sandbox; cross-agent leak is prevented by the existing read_file boundary.

Applied at three sites to prevent drift:
  - llm/caller.py _process_tool_call (call_llm path)
  - llm/caller.py call_agent_llm_with_tools inner loop
  - services/heartbeat.py tool loop

agent_context.py rule 3 augmented with a paragraph teaching the model to
recognize the marker and use read_file for full content rather than
fabricating from prior knowledge.

16 new unit tests covering pass-through, spill, utf-8 round-trip,
threshold boundary, JSON-shape preservation, write-failure fallback, and
a realistic 50-result jina_search end-to-end. 41/41 total pass.

Note: 3 tool loops still exist (caller.py:432, caller.py:763,
heartbeat.py:335). Drift prevention is partial here — PR 3 consolidates
them into a single execute_tool_calls helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@TatsuKo-Tsukimi TatsuKo-Tsukimi force-pushed the feat/tool-result-truncation branch from 1cff15b to b01c0f1 Compare April 27, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant